Graph Labelling Workshop and Web Spam Challenge
نویسندگان
چکیده
We compare a wide range of semi-supervised learning techniques both for Web spam filtering and for telephone user churn classification. Semisupervised learning has the assumption that the label of a node in a graph is similar to those of its neighbors. In this paper we measure this phenomenon both for Web spam and telco churn. We conclude that spam is often linked to spam while honest pages are linked to honest ones; similarly churn occurs in bursts in groups of a social network.
منابع مشابه
Fast Asynchronous Anti-TrustRank for Web Spam Detection
Web spam detection is an important problem in Web search. Since Web spam pages tend to have a lot of spurious links, many Web spam detection algorithms exploit the hyperlink structure between the Web pages to detect the spam pages. Anti-TrustRank algorithm is a well-known link-based spam detection algorithm which follows the principle that spam pages are likely to be referenced by other spam pa...
متن کاملWeb Spam Challenge 2007 Track II Secure Computing Corporation Research
To discriminate spam Web hosts/pages from normal ones, text-based and link-based data are provided for Web Spam Challenge Track II. Given a small part of labeled nodes (about 10%) in a Web linkage graph, the challenge is to predict other nodes’ class to be spam or normal. We extract features from link-based data, and then combine them with text-based features. After feature scaling, Support Vec...
متن کاملA Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion
This paper describes a machine learning approach for detecting web spam. Each example in this classification task corresponds to 100 web pages from a host and the task is to predict whether this collection of pages represents spam or not. This task is part of the 2007 ECML/PKDD Graph Labeling Workshop’s Web Spam Challenge (track 2). Our approach begins by adding several human-engineered feature...
متن کاملA Survey on Web Spam and Spam 2.0
In current scenario web is huge, highly distributive, open in nature and changing rapidly. The open nature of web is the main reason for rapid growth but it has imposed a challenge to Information Retrieval. The one of the biggest challenge is spam. We focus here to have a study on different forms of the web spam and its new variant called spam 2.0, existing detection methods proposed by differe...
متن کاملUsing Rank Propagation and Probabilistic Counting for Link-Based Spam Detection
This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and difficult to solve, mostly due to the large size of web collections that makes many algorithms unfeasible in practice. For spam detection we app...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007